The search functionality is under construction.
The search functionality is under construction.

Keyword Search Result

[Keyword] speaker adaptation(27hit)

21-27hit(27hit)

  • Speaker Adaptation Based on a Maximum Observation Probability Criterion

    Tae-Young YANG  Chungyong LEE  Dae-Hee YOUN  

     
    LETTER-Speech and Hearing

      Vol:
    E84-D No:2
      Page(s):
    286-288

    A speaker adaptation technique that maximizes the observation probability of an input speech is proposed. It is applied to semi-continuous hidden Markov model (SCHMM) speech recognizers. The proposed algorithm adapts the mean µ and the covariance Σ iteratively by the gradient search technique so that the features of the adaptation speech data could achieve maximum observation probabilities. The mixture coefficients and the state transition probabilities are adapted by the model interpolation scheme. The main advantage of this scheme is that the means and the variances, which are common to all states in SCHMM, are adapted independently from the other parameters of SCHMM. It allows fast and precise adaptation especially when there is a large acoustic mismatch between the reference model and a new speaker. Also, it is possible that this scheme could be adopted to other areas which use codebook. The proposed adaptation algorithm was evaluated by a male speaker-dependent, a female speaker-dependent, and a speaker-independent recognizers. The experimental results on the isolated word recognition showed that the proposed adaptation algorithm achieved 46.03% average enhancement in the male speaker-dependent recognizer, 52.18% in the female speaker-dependent recognizer, and 9.84% in the speaker-independent recognizer.

  • Discriminative Training Based on Minimum Classification Error for a Small Amount of Data Enhanced by Vector-Field-Smoothed Bayesian Learning

    Jun-ichi TAKAHASHI  Shigeki SAGAYAMA  

     
    PAPER-Speech Processing and Acoustics

      Vol:
    E79-D No:12
      Page(s):
    1700-1707

    This paper describes how to effectively use discriminative training based on Minimum Classification Error (MCE) criterion for a small amount of data in order to attain the highest level of recognition performance. This method is a combination of MCE training and Vector-Field-Smoothed Bayesian learning called MAP/VFS, which combines maximum a posteriori (MAP) estimation with Vector Field Smoothing (VFS). In the proposed method, MAP/VFS can significantly enhance MCE training in the robustness of acoustic modeling. In model training, MCE training is performed using the MAP/VFS-trained model as an initial model. The same data are used in both trainings. For speaker adaptation using several dozen training words, the proposed method has been experimentally proven to be very effective. For 50-word training data, recognition errors are drastically reduced by 47% compared with 16.5% when using only MCE. This high rate, in which 39% is due to MAP, an additional 4% is due to VFS, and a further improvement of 4% is due to MCE, can be attained by enhancing MCE training capability by MAP/VFS.

  • Unsupervised Speaker Adaptation Using All-Phoneme Ergodic Hidden Markov Network

    Yasunage MIYAZAWA  Jun-ichi TAKAMI  Shigeki SAGAYAMA  Shoichi MATSUNAGA  

     
    PAPER-Speech Processing and Acoustics

      Vol:
    E78-D No:8
      Page(s):
    1044-1050

    This paper proposes an unsupervised speaker adaptation method using an all-phoneme ergodic Hidden Markov Network" that combines allophonic (context-dependent phone) acoustic models with stochastic language constraints. Hidden Markov Network (HMnet) for allophone modeling and allophonic bigram probabilities derived from a large text database are combined to yield a single large ergodic HMM which represents arbitrary speech signals in a particular language so that the model parameters can be re-estimated using text-unknown speech samples with the Baum-Welch algorithm. When combined with the Vector Field Smoothing (VFS) technique, unsupervised speaker adaptation can be effectively performed. This method experimentally gave better performances compared with our previous unsupervised adaptation method which used conventional phonetic HMMs and phoneme bigram probabilities especially when the amount of training data was small.

  • Speaker-Consistent Parsing for Speaker-Independent Continuous Speech Recognition

    Kouichi YAMAGUCHI  Harald SINGER  Shoichi MATSUNAGA  Shigeki SAGAYAMA  

     
    PAPER

      Vol:
    E78-D No:6
      Page(s):
    719-724

    This paper describes a novel speaker-independent speech recognition method, called speaker-consistent parsing", which is based on an intra-speaker correlation called the speaker-consistency principle. We focus on the fact that a sentence or a string of words is uttered by an individual speaker even in a speaker-independent task. Thus, the proposed method searches through speaker variations in addition to the contents of utterances. As a result of the recognition process, an appropriate standard speaker is selected for speaker adaptation. This new method is experimentally compared with a conventional speaker-independent speech recognition method. Since the speaker-consistency principle best demonstrates its effect with a large number of training and test speakers, a small-scale experiment may not fully exploit this principle. Nevertheless, even the results of our small-scale experiment show that the new method significantly outperforms the conventional method. In addition, this framework's speaker selection mechanism can drastically reduce the likelihood map computation.

  • Automatic Evaluation of English Pronunciation Based on Speech Recognition Techniques

    Hiroshi HAMADA  Satoshi MIKI  Ryohei NAKATSU  

     
    PAPER-Speech Processing

      Vol:
    E76-D No:3
      Page(s):
    352-359

    A new method is proposed for automatically evaluating the English pronunciation quality of non-native speakers. It is assumed that pronunciation can be rated using three criteria: the static characteristics of phonetic spectra, the dynamic structure of spectrum sequences, and the prosodic characteristics of utterances. The evaluation uses speech recognition techniques to compare the English words pronounced by a non-native speaker with those pronounced by a native speaker. Three evaluation measures are proposed to rate pronunciation quality. (1) The standard deviation of the mapping vectors, which map the codebook vectors of the non-native speaker onto the vector space of the native speaker, is used to evaluate the static phonetic spectra characteristics. (2) The spectral distance between words pronounced by the non-native speaker and those pronounced by the native speaker obtained by the DTW method is used to evaluate the dynamic characteristics of spectral sequences. (3) The differences in fundamental frequency and speech power between the pronunciation of the native and non-native speaker are used as the criteria for evaluating prosodic characteristics. Evaluation experiments are carried out using 441 words spoken by 10 Japanese speakers and 10 native speakers. One half of the 441 words was used to evaluate static phonetic spectra characteristics, and the other half was used to evaluate the dynamic characteristics of spectral sequences, as well as the prosodic characteristics. Based on the experimental results, the correlation between the evaluation scores and the scores determined by human judgement is found to be 0.90.

  • Speaker Weighted Training of HMM Using Multiple Reference Speakers

    Hiroaki HATTORI  Satoshi NAKAMURA  Kiyohiro SHIKANO  Shigeki SAGAYAMA  

     
    PAPER-Speech Processing

      Vol:
    E76-D No:2
      Page(s):
    219-226

    This paper proposes a new speaker adaptation method using a speaker weighting technique for multiple reference speaker training of a hidden Markov model (HMM). The proposed method considers the similarities between an input speaker and multiple reference speakers, and use the similarities to control the influence of the reference speakers upon HMM. The evaluation experiments were carried out through the/b, d, g, m, n, N/phoneme recognition task using 8 speakers. Average recognition rates were 68.0%, 66.4%, and 65.6% respectively for three test sets which have different speech styles. These were 4.8%, 8.8%, and 10.5% higher than the rates of the spectrum mapping method, and also 1.6%, 6.7%, and 8.2% higher than the rates of the multiple reference speaker training, the supplemented HMM. The evaluation experiments clarified the effectiveness of the proposed method.

  • Speaker Adaptation Based on Vector Field Smoothing

    Hiroaki HATTORI  Shigeki SAGAYAMA  

     
    PAPER-Speech Processing

      Vol:
    E76-D No:2
      Page(s):
    227-234

    This paper describes a new supervised speaker adaptation method based on vector field smoothing, for small size adaptation data. This method assumes that the correspondence of feature vectors between speakers can be viewed as a kind of smooth vector field, and interpolation and smoothing of the correspondence are introduced into the adaptation process for higher adaptation performance with small size data. The proposed adaptation method was applied to discrete HMM based speech recognition and evaluated in Japanese phoneme and phrase recognition experiments. Using 10 words as the adaptation data, the proposed method produced almost the same results as the conventional codebook mapping method with 25 words. These experiments clearly comfirmed the effectiveness of the proposed method.

21-27hit(27hit)